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Practitioners of Bayesian statistics have long depended on Markov 
chain Monte Carlo (MCMC) to obtain samples from intractable pos¬ 
terior distributions. Unfortunately, MCMC algorithms are typically 
serial, and do not scale to the large datasets typical of modern ma¬ 
chine learning. The recently proposed consensus Monte Carlo algo¬ 
rithm removes this limitation by partitioning the data and drawing 
samples conditional on each partition in parallel [22]. A fixed aggre¬ 
gation function then combines these samples, yielding approximate 
posterior samples. We introduce variational consensus Monte Carlo 
(VCMC), a variational Bayes algorithm that optimizes over aggrega¬ 
tion functions to obtain samples from a distribution that better ap¬ 
proximates the target. The resulting objective contains an intractable 
entropy term; we therefore derive a relaxation of the objective and 
show that the relaxed problem is blockwise concave under mild con¬ 
ditions. We illustrate the advantages of our algorithm on three in¬ 
ference tasks from the literature, demonstrating both the superior 
quality of the posterior approximation and the moderate overhead 
of the optimization step. Our algorithm achieves a relative error re¬ 
duction (measured against serial MCMC) of up to 39% compared 
to consensus Monte Carlo on the task of estimating 300-dimensional 
probit regression parameter expectations; similarly, it achieves an er¬ 
ror reduction of 92% on the task of estimating cluster comembership 
probabilities in a Gaussian mixture model with 8 components in 8 
dimensions. Furthermore, these gains come at moderate cost com¬ 
pared to the runtime of serial MCMC—achieving near-ideal speedup 
in some instances. 


1. Introduction. Modern statistical inference demands scalability to massive datasets and 
high-dimensional models. Innovation in distribnted and stochastic optimization has enabled pa¬ 
rameter estimation in this setting, e.g. via stochastic [3] and asynchronous [20] variants of gradient 
descent. Achieving similar success in Bayesian inference - where the target is a posterior distribution 
over parameter values, rather than a point estimate - remains computationally challenging. 

Two dominant approaches to Bayesian computation are variational Bayes and Markov chain 
Monte Carlo (MCMC). Within the former, scalable algorithms like stochastic variational infer¬ 
ence [11] and streaming variational Bayes [4] have successfully imported ideas from optimization. 
Within MCMC, adaptive subsampling procedures [2, 14], stochastic gradient Langevin dynam¬ 
ics [25], and Firefly Monte Carlo [16] have applied similar ideas, achieving computational gains by 
operating only on data subsets. These algorithms are serial, however, and thus cannot take advan¬ 
tage of multicore and multi-machine architectures. This motivates data-parallel MCMC algorithms 
such as asynchronous variants of Gibbs sampling [1, 8, 12]. 

Our work belongs to a class of communieation-avoiding data-parallel MCMC algorithms. These 
algorithms partition the full dataset Xi-j^ into K disjoint subsets where Xj^ denotes the data 

associated with core k. Each core samples from a subposterior distribution, 

Pk {Ok) oc p {Xi^ I Ok) p {Ok)^^^ , 


( 1 ) 
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and then a centralized procedure combines the samples into an approximation of the full posterior. 
Due to their efficiency, such procedures have recently received substantial attention [18, 22, 24]. 

One of these algorithms, consensus Monte Carlo (CMC), requires communication only at the 
start and end of sampling [22]. CMC proceeds from the intuition that subposterior samples, when 
aggregated correctly, can approximate full posterior samples. This is formally backed by the fac¬ 
torization 

K K 

(2) P {0 I xi-n) ocp{9)Y\p {Xi^ \0) = '[\pk {9). 

k=l k=l 

If one can approximate the subposterior densities pk, using kernel density estimates for instance [18], 
it is therefore possible to recombine them into an estimate of the full posterior. 

Unfortunately, the factorization does not make it immediately clear how to aggregate on the 
level of samples without first having to obtain an estimate of the densities pk themselves. CMC 
alters (2) to untie the parameters across partitions and plug in a deterministic link F from the 9^ 
to 9: 

K 

(3) p {9 I xi-,n) ~ n Pfc {9k) • S9=F(9u-,eK)- 

k=l 

This approximation and an aggregation function motivated by a Gaussian approximation lie at the 
core of the CMC algorithm [22]. 

The introduction of CMC raises numerous interesting questions whose answers are essential 
to its wider application. Two among these stand out as particularly vital. First, how should the 
aggregation function be chosen to achieve the closest possible approximation to the target posterior? 
Second, when model parameters exhibit structure or must conform to constraints — if they are, 
for example, positive semidefinite covariance matrices or labeled centers of clusters — how can the 
weighted averaging strategy of Scott et al. [22] be modified to account for this structure? 

In this paper, we propose variational consensus Monte Carlo (VCMC), a novel class of data- 
parallel MCMC algorithms that allow both questions to be addressed. By formulating the choice of 
aggregation function as a variational Bayes problem, VCMC makes it possible to adaptively choose 
the aggregation function to achieve a closer approximation to the true posterior. The flexibility 
of VCMC likewise supports nonlinear aggregation functions, and we exploit this versatility to 
introduce structured aggregation functions applicable to inference problems that are not purely 
vectorial. 

An appealing benefit of the VCMC point of view is a clarification of the untying step leading 
to (3). In VCMC, the approximate factorization corresponds to a variational approximation to the 
true posterior. This approximation can be viewed as the joint distribution of {9i,... ,9k) and 9 
in an augmented model that assumes conditional independence between the data partitions and 
posits a deterministic mapping from partition-level parameters to the single global parameter. 
The added flexibility of this point-of-view makes it possible to move beyond subposteriors and 
include alternative forms of (3) within the CMC framework. In particular, it is possible to define 
Pk{Gk) = p{(^k)p{Xij, \9k), using partial posteriors in place of subposteriors (cf. [23]). Although 
extensive investigation of this issue is beyond the scope of this paper, we provide some evidence in 
Section 6 that partial posteriors are a better choice in some circumstances and demonstrate that 
VCMC can provide substantial gains in both the partial posterior and subposterior settings. 

Before proceeding, we outline the remainder of this paper. Below, in §2, we review CMC and 
related data-parallel MCMC algorithms. Next, we cast CMC as a variational Bayes problem in §3. 
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We define the variational optimization objective in §4, addressing the challenging entropy term 
by relaxing it to a concave lower bound, and give conditions for which this leads to a blockwise 
concave maximization problem. In §5, we define several classes of aggregation functions, and design 
novel classes that enable structured aggregation of constrained and structured samples—e.g. positive 
semidefinite matrices and mixture model parameters. In §6, we evaluate the performance of VCMC 
and CMC relative to serial MCMC. We replicate experiments carried out by Scott et al. [22] and 
execute more challenging experiments in higher dimensions and with more data. Finally in §7, we 
summarize our approach and discuss several open problems generated by this work. 

2. Related work. We focus on data-parallel MCMC algorithms for large-scale Bayesian pos¬ 
terior sampling. Several recent research threads propose schemes in the setting where the posterior 
factors as in (2). In general, these parallel strategies are approximate relative to serial procedures, 
and the specific algorithms differ in terms of the approximations employed and amount of commu¬ 
nication required. 

At one end of the communication spectrum are algorithms that fit into the MapReduce model [7] . 
First, K parallel cores sample from K subposteriors, defined in (1), via any Monte Carlo sampling 
procedure. The subposterior samples are then aggregated to obtain approximate samples from the 
full posterior. This leads to the challenge of designing proper and efficient aggregation procedures. 

Scott et al. [22] propose consensus Monte Carlo (CMC), which constructs approximate posterior 
samples via weighted averages of subposterior samples; our algorithms are motivated by this work. 
Let denote the t-th subposterior sample from core k. In CMC, the aggregation function averages 
across each set of K samples produce one approximate posterior sample 9t. Uniform 

averaging is a natural but naive heuristic that can in fact be improved upon via a weighted average, 

K 

(4) 6 = F{ei..K) = Y.Wk9k, 

k=l 

where in general, 9k is a vector and Wk can be a matrix. The authors derive weights motivated 
by the special case of a Gaussian posterior, where each subposterior is consequently also Gaussian. 
Let Sfc be the covariance of the /s-th subposterior. This suggests weights Wk = equal to the 
subposteriors’ inverse covariances. CMC treats arbitrary subpostertiors as Caussians, aggregating 
with weights given by empirical estimates of computed from the observed subposterior samples. 

Neiswanger et al. [18] propose aggregation at the level of distributions rather than samples. 
Here, the idea is to form an approximate posterior via a product of density estimates fit to each 
subposterior, and then sample from this approximate posterior. The accuracy and computational 
requirements of this approach depend on the complexity of these density estimates. Wang and Dun- 
son [24] develop alternate data-parallel MCMC methods based on applying a Weierstrass transform 
to each subposterior. These Weierstrass sampling procedures introduce auxiliary variables and ad¬ 
ditional communication between computational cores. 

3. Consensus Monte Carlo as variational inference. Given the distributional form of the 
CMC framework (3), we would like to choose F so that the induced distribution on 9 is as close as 
possible to the true posterior. This is precisely the problem addressed by variational Bayes, which 
approximates an intractable posterior p{9 \ X) by the solution q* to the constrained optimization 
problem 

minDKL (q II p(' I X)) subject to q G Q, 
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where Q is the family of variational approximations to the distribution, usually chosen to make both 
optimization and evaluation of target expectations tractable. We thus view the aggregation problem 
in CMC as a variational inference problem, with the variational family given by all distributions 
Q = Qjr = {qp: F G F}, where each F is in some function class F and defines a density 

K 

QF {G) = j TT Pk (Ok) ■ (50=F(6ii,...,e^) d^i:^^:. 

k=i 

In practice, we parameterize by a finite dimensional set of parameters and optimize over it using 
projected stochastic gradient descent (SGD). 


4. The variational optimization problem. Standard optimization of the variational Bayes 
objective uses the evidence lower bound (ELBO) 


Pie, X) 


QiO) 


> Eq 


log 


Pie, X) 


QiO) 


logp(X) = log Eg 

(5) = logp (X) - Dkl (g 11 p (• I X)) =: £vb iq) ■ 

We can therefore recast the variational optimization problem in an equivalent form as 

maxTvB iq) subject to q ^ Q. 


Unfortunately, the variational Bayes objective £vb remains difficult to optimize. Indeed, by 
writing 

£vB((?)=Eq[logp(0, X)]+H[(?] 

we see that optimizing £vb requires computing an entropy H [g] and its gradients. We can deal 
with this issue by deriving a lower bound on the entropy that relaxes the objective further. 

Concretely, suppose that every F G F can be decomposed as F iOi-x) = Ylk=i^kiOk), with 
each Fk a differentiable bijection. Since the 0^ come from subposteriors conditioning on different 
segments of the data, they are independent. The entropy power inequality [6] therefore implies 


H [o'] > max H [Fk (^fc)] 


max (H \pk] + Ep^^ [logdet [J (Efc) (0^)]]) 

l<k<K 


(6) > min H [pfc] + max Ep^^ [log det [J (Ffc) (0^)]] 

l<k<K l<k<K 

I ^ 

(7) > min H [pfc] + — ^ Ep^ [log det [J (Fk) ((9^)]] =: H [g] , 

l<k<K A ^^ 


where J (/) (0) denotes the Jacobian of the function / evaluated at 0. The proof of this inequality 
can be found in the supplement. 

This approach gives an explicit, easily computed approximation to the entropy—and this approx¬ 
imation is a lower bound, allowing us to interpret it simply as a further relaxation of the original 
inference problem. Furthermore, and crucially, it decouples pk and F^, thereby making it possible 
to optimize over F^ without estimating the entropy of any pk- We note additionally that if we are 
willing to sacrifice concavity, we can use the tighter lower bound on the entropy given by (6). 

Putting everything together, we can define our relaxed variational objective as 


( 8 ) 


£(g) = Eq[logp(0, X)]+H[g]. 


Maximizing this function is the variational Bayes problem we consider in the remainder of the 
paper. 
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Conditions for concavity. Under certain conditions, the problem posed above is blockwise concave. 
To see when this holds, we use the language of graphical models and exponential families. To derive 
the result in the greatest possible generality, we decompose the variational objective as 

£vb = E, [logp (0, X)] + H [g] > £ + H [q] 

and prove concavity directly for C, then treat our choice of relaxed entropy (7). We emphasize that 
while the entropy relaxation is only defined for decomposed aggregation functions, concavity of the 
partial objective holds for arbitrary aggregation functions. All proofs are in the supplement. 

Suppose the model distribution is specified via a graphical model G, so that 9 = i9'^)u&v{G)^ 
such that each conditional distribution is defined by an exponential family 

logp I = log/l“ (r) + _ 

li'Gpar(u) 

If each of these log conditional density functions is log-concave in we can guarantee that the 
log likelihood is concave in each 0“ individually. 

Theorem 4.1 (Blockwise concavity of the variational cross-entropy). Suppose that the model 
distribution is specified by a graphical model G in whieh eaeh conditional probability density is 
a log-eoncave exponential family. Suppose further that the variational aggregation funetion family 
satisfies T = OneElG) such that we can decompose each aggregation function across nodes via 

F(0) = (F“(nW(G)> FeT and e 

If each is a convex subset of some vector space T-F, then the variational cross-entropy L is 
concave in each individually. 

Assuming that the aggregation function can be decomposed into a sum over functions of indi¬ 
vidual subposterior terms we can also prove concavity of our entropy relaxation (7). 

Theorem 4.2 (Concavity of the relaxed entropy). Suppose F = each function 

F ^ F decomposing as F {9i,..., 9 k) = {9k) for unique bijective Fk € Fk- Then the relaxed 

entropy (7) is concave in F. 

As a result, we derive concavity of the variational objective in a broad range of settings. 

Corollary 4.1 (Concavity of the variational objective). Under the hypotheses of Theorems 
4.1 and 4-2, the variational Bayes objective C = C + IB is concave in each individually. 

5. Variational aggregation function families. The performance of our algorithm depends 
critically on the choice of aggregation function family F. The family must be sufficiently simple 
to support efficient optimization, expressive to capture the complex transformation from the set of 
subposteriors to the full posterior, and structured to preserve structure in the parameters. We now 
illustrate some aggregation functions that meet these criteria. 

5.1. Vector aggregation. In the simplest case, 0 G is an unconstrained vector. Then, a linear 
aggregation function Fw = Ylk=i ^k^k makes sense, and it is natural to impose constraints to make 
this sum behave like a weighted average—i.e., each Wk G Sf is a positive semidefinite (PSD) matrix 
and ^k = Id- For computational reasons, it is often desirable to restrict to diagonal Wk. 
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5.2. Spectral aggregation. Cases involving structure exhibit more interesting behavior. Indeed, 

if our parameter is a PSD matrix A G 51^, applying the vector aggregation function above to 
the flattened vector form vec (A) of the parameter does not suffice. Denoting elementwise matrix 
product as o, we note that this strategy would in general lead to Fw (Ai:m) = ° ^ Sf. 

We therefore introduce a more sophisticated aggregation function that preserves PSD structure. 
For this, given symmetric A G define R (A) and D (A) to be orthogonal and diagonal matri¬ 

ces, respectively, such that A = R (A)^ D (A) R (A). Impose further—and crucially—the canonical 
ordering D (A)^^ > • • • > D (A)^^. We can then define our spectral aggregation function by 

K 

(Ai,^) = Y,R (Rkf [WkD (Afc)] R (Afc) . 

k=l 

Assuming G 51^, the output of this function is guaranteed to be PSD, as required. As above we 
restrict the set of Wk to the matrix simplex : Wk G Ylk=i 

5.3. Combinatorial aggregation. Additional complexity arises with unidentifiable latent vari¬ 
ables and, more generally, models with multimodal posteriors. Since this class encompasses many 
popular algorithms in machine learning, including factor analysis, mixtures of Gaussians and multi¬ 
nomials, and latent Dirichlet allocation (LDA), we now show how our framework can accommodate 
them. 

For concreteness, suppose now that our model parameters are given by 0 G where L 

denotes the number of global latent variables (e.g. cluster centers). We introduce discrete alignment 
parameters Ok that indicate how latent variables associated with partitions map to global latent 
variables. Each ak is thus a one-to-one correspondence [L] —)> [L], with Oki denoting the index on 
worker core k of cluster center i. For fixed a, we then obtain the variational aggregation function 

. K .L 

Fa{ei:K) = (Y.Wkieka,,ie)) ■ 

\=1 ^ ^=1 

Optimization can then proceed in an alternating manner, switching between the alignments Ok 
and the weights Wk, or in a greedy manner, fixing the alignments at the start and optimizing the 
weight matrices. In practice, we do the latter, aligning using a simple heuristic objective O (a) = 
Ylk=2'^e=i “ ^ 1^112 ’ ^ke denotes the mean value of cluster center i on partition k. 

As O suggests, we set an = i. This is permitted because the global order is only determined up to a 
permutation. We find that minimizing O via the Hungarian algorithm [15] leads to good alignments. 

6. Empirical evaluation. We now evaluate VCMC on three inference problems, in a range of 
data and dimensionality conditions. In the vector parameter case, we compare directly to the simple 
weighting baselines corresponding to previous work on CMC [22]; in the other cases, we compare to 
structured analogues of these weighting schemes. Our experiments demonstrate the advantages of 
VCMC across the whole range of model dimensionality, data quantity, and availability of parallel 
resources. 

Baseline weight settings. Scott et al. [22] studied linear aggregation functions with fixed weights, 
(9) wr^ = ^-Id and Wf"^^(xdiag(Sfc)“\ 

corresponding to uniform averaging and Gaussian averaging, respectively, where denotes the 
standard empirical estimate of the covariance. These are our baselines for comparison. 
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Fig 1: High-dimensional probit regression [d = 300). Moment approximation error for the uniform 
and Gaussian averaging baselines and VCMC, relative to serial MCMC, for (left) subposteriors and 
(right) partial posteriors. We assessed three groups of functions: first moments, with /(/3) = (dj 
for 1 < j < d; pure second moments, with /(/3) = for 1 < j < d; and mixed second moments, 
with /(/?) = (itfij for 1 < i < j < d. For brevity, results for pure second moments are relegated to 
Figure 6 in the supplement. Relative errors are truncated to 2 in all cases. 


Evaluation metrics. Since the goal of MCMC is usually to estimate event probabilities and function 
expectations, we evaluate algorithm accuracy for such estimates, relative to serial MCMC output. 
For each model, we consider a suite of test functions f & T (e.g. low degree polynomials, cluster 
comembership indicators), and we assess the error of each algorithm A using the metric 

/ .X _ \^A [/] ~ IEmcmc [/]| 

IIEmcmc [/]| 

In the body of the paper, we report median values of e^, computed within each test function class. 
The supplement expands on this further, showing quartiles for the differences in eycMC and ecMC- 


6.1. Bayesian probit regression. We consider the nonconjugate probit regression model. In this 
case, we use linear aggregation functions as our function class. For computational efficiency, we also 
limit ourselves to diagonal W^. We use Gibbs sampling on the following augmented model: 


/3~AA(0, fj^/rf) 
Zn I /3, ~ M'i/3'^Xn, 1) 


I Zji, (3, Xj^ 


1 if > 0, 

0 otherwise. 


This augmentation allows us to implement an efficient and rapidly mixing Gibbs sampler, where 


/3 I XUAT = X 

= z^M (SX^z, S) 

S= (cT-2/rf + X^X)"\ 

We run two experiments: the first using a data generating distribution from Scott et al. [22], with 
N = 8500 data points and d = 5 dimensions, and the second using N = 10^ data points and d = 300 
dimensions. As shown in Figure 1 and, in the supplement,^ Figures 5 and 6, VCMC decreases the 

^Due to space constraints, we relegate results for d = 5 to the supplement. 
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Fig 2: High-dimensional normal-inverse Wishart model {d = 100). (Far left, left, right) Moment 
approximation error for the uniform and Gaussian averaging baselines and VCMC, relative to serial 
MCMC. Letting pj denote the largest eigenvalue of A“^, we assessed three groups of functions: 
first moments, with /(A) = pj for 1 < j < d; pure second moments, with /(A) = for 1 < j < d; 
and mixed second moments, with /(A) = pipj for 1 < i < j < d. (Far right) Graph of error in 
estimating E \pj\ as a function of j (where pi> P 2 > ■ ■ ■ > Pd)- 


error of moment estimation compared to the baselines, with substantial gains starting at AT = 25 
partitions (and increasing with K). We also run the high-dimensional experiment using partial 
posteriors [23] in place of subposteriors, and observe substantially lower errors in this case. 

6.2. Normal-inverse Wishart model. To compare directly to prior work [22], we consider the 
normal-inverse Wishart model 

A ~ Wishart (v, V) 

Xn \p, A-1). 

Here, we use spectral aggregation rules as our function class, restricting to diagonal Wk for com¬ 
putational efficiency. We run two sets of experiments: one using the covariance matrix from Scott 
et al. [22], with N = 5000 data points and d = 5 dimensions, and one using a higher-dimensional 
covariance matrix designed to have a small spectral gap and a range of eigenvalues, with N = 10® 
data points and d = 100 dimensions. In both cases, we use a form of projected SGD, using 40 sam¬ 
ples per iteration to estimate the variational gradients and running 25 iterations of optimization. 
We note that because the mean p is treated as a point-estimated parameter, one could sample A 
exactly using normal-inverse Wishart conjugacy [10]. As Figure 2 shows, ^ VGMC improves both 
first and second posterior moment estimation as compared to the baselines. Here, the greatest gains 
from VGMC appear at large numbers of partitions {K = 50,100). We also note that uniform and 
Gaussian averaging perform similarly because the variances do not differ much across partitions. 

6.3. Mixture of Gaussians. A substantial portion of Bayesian inference focuses on latent vari¬ 
able models and, in particular, mixture models. We therefore evaluate VCMC on the mixture of 
Gaussians model defined by 

di-.L ~ A/" (O, r^/rf) 

Zn ~ Cat (tt) 

Xn\Zn = Z^M (d., , 

where the mixture weights vr and the prior and likelihood variances and are assumed known. 
We use the combinatorial aggregation functions defined in Section 5;wesetL = 8,T = 2,cr = l, 


^Due to space constraints, we compare to the d = 5 experiment of Scott et al. [22] in the supplement. 
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Fig 3; Mixture of Gaussians (d = 8, L = 8). Expectation approximation error for the uniform and 
Gaussian baselines and VGMG. We report the median error, relative to serial MGMG, for cluster 
comembership probabilities of pairs of test data points, for (left) a = 1 and (right) a = 2, where we 
run the VCMG optimization procedure for 50 and 200 iterations, respectively. When cr = 2, some 
comembership probabilities are estimated poorly by all methods; we therefore only use the 70% of 
comembership probabilities with the smallest errors across all the methods. 


and vr uniform and generate N = 5 x 10^ data points in d = 8 dimensions, using the model 
from Nishihara et al. [19]. The resulting inference problem is therefore L x d = 64-dimensional. All 
samples were drawn using the PyStan implementation of Hamiltonian Monte Garlo (HMG). 

As Figure 3 shows, VGMG drastically improves moment estimation compared to the baseline 
Gaussian averaging (9). To assess how VCMG influences estimates in cluster membership prob¬ 
abilities, we generated 100 new test points from the model and analyzed cluster comembership 
probabilities for all pairs in the test set. Concretely, for each Xi and Xj in the test data, we es¬ 
timated P \xi and Xj belong to the same cluster]. Figure 3 shows the resulting boost in accuracy: 
when cr = 1, VCMG delivers estimates close to those of serial MCMC, across all numbers of parti¬ 
tions; the errors are larger for a = 2. Unlike previous models, uniform averaging here outperforms 
Gaussian averaging, and indeed is competitive with VCMG. 

6.4. Assessing computational efficiency. The efficiency of VCMG depends on that of the opti¬ 
mization step, which depends on factors including the step size schedule, number of samples used 
per iteration to estimate gradients, and size of data minibatches used per iteration. Extensively 
assessing the influence of all these factors is beyond the scope of this paper, and is an active area of 
research both in general and specifically in the context of variational inference [13, 17, 21]. Here, we 
provide an initial assessment of the computational efficiency of VGMG, taking the probit regression 
and Gaussian mixture models as our examples, using step sizes and sample numbers from above, 
and eschewing minibatching on data points. 

Figure 4 shows timing results for both models. For the probit regression, while the optimization 
cost is not negligible, it is significantly smaller than that of serial sampling, which takes over 6000 
seconds to produce 1000 effective samples.^ Across most numbers of partitions, approximately 25 
iterations—corresponding to less than 1500 seconds of wall clock time—suffices to give errors close 
to those at convergence. For the mixture, on the other hand, the computational cost of optimization 
is minimal compared to serial sampling. We can see this in the overall speedup of VGMG relative to 
serial MGMG: for sampling and optimization combined, low numbers of partitions {K < 25) achieve 

^We ran the sampler for 5100 iterations, including 100 burnin steps, and kept every fifth sample. 
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Fig 4: Error versus timing and speedup measurements. (Left) VCMC error as a function of number 
of seconds of optimization. The cost of optimization is nonnegligible, but still moderate compared 
to serial MCMC—particularly since our optimization scheme only needs small batches of samples 
and can therefore operate concurrently with the sampler. (Right) Error versus speedup relative to 
serial MCMC, for both CMC with Gaussian averaging (small markers) and VCMC (large markers). 


speedups close to the ideal value of K, and large numbers {K = 50, 100) still achieve good speedups 
of about K/2. The cost of the VCMC optimization step is thus moderate—and, when the MCMC 
step is expensive, small enough to preserve the linear speedup of embarrassingly parallel sampling. 
Moreover, since the serial bottleneck is an optimization, we are optimistic that performance, both in 
terms of number of iterations and wall clock time, can be significantly increased by using techniques 
like data minibatching [9], adaptive step sizes [21], or asynchronous updates [20]. 

7. Conclusion and future work. The flexibility of variational consensus Monte Carlo (VCMC) 
opens several avenues for further research. Eollowing previous work on data-parallel MCMC, we 
used the subposterior factorization. Our variational framework can accomodate more general fac¬ 
torizations that might be more statistically or computationally efficient - e.g. the factorization used 
by Broderick et al. [4]. We also introduced structured sample aggregation, and analyzed some con¬ 
crete instantiations. Complex latent variable models would require more sophisticated aggregation 
functions - e.g. ones that account for symmetries in the model [5] or lift the parameter to a higher 
dimensional space before aggregating. Finally, recall that our algorithm - again following previous 
work - aggregates in a sample-by-sample manner, cf. (4). Other aggregation paradigms may be 
useful in building approximations to multimodal posteriors or in boosting the statistical efficiency 
of the overall sampler. 
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APPENDIX A: PROOFS 


Proof of entropy relaxation. We apply the entropy power inequality [6], which asserts 
that for independent d-dimensional random vectors the sum 

K 

k=l 


satisfies 

( 10 ) 


K 


2h(ip) ^ 

e d > y e d > max e 
l<k<K 


k=l 


where h denotes differential entropy. 
In our case, we have 

and 

Since 

equation (10) implies 


i’k = Fk {9k) 

^ = d = F(di,...,d^) 
H [g] = h (lA), 


H [g] > max h (ipk) = max (H \pk] + [logdet J (Fk) {9k)]) ■ 

l<k<K l<k<K 


Defining 


K 


® Z] [logdet J {Fk) {9k)] + ^ min^H [pfc] , 


k=l 


l<k<K 


we immediately see that 


H[g] >H[g], 


as required. 

Proof of Theorem 4.1. We first define 


□ 


£o (g) = E, [logp{9, X) I 9v.k] = logp{F {9i..k) , X). 


Since C (g) = Ep^.^ [Cq (g)], where the expectation is taken with respect to the subposteriors, which 
do not vary with g, it suffices to show that Cq is concave in each individually for each fixed 9i-k- 
Furthermore, since F {9i-k) is linear in F by the definition of function addition, it actually suffices 
to show ^ {9) = logp {F {9i-k) , X) in each 0“ individually. To see why this holds, first observe that 
for each u G V{G), we have 


( 11 ) 

( 12 ) 


i {9) = log (r) + Yj (^“0 ^ (^“0 

w'Gpar(w) 

+ [(^“)^ {G1 - log 

i;Gch(w) 


+ Cu, 


12 






where is a function of 6 that is constant in By the log-concavity assumption, the sum of 
the first two terms of ^ [6) in (12) is concave in On the other hand, by basic properties of 
exponential families, each log^’^ (0pa'>'(^)) ig convex in 0 p^''O) and hence in making its negative 
concave. Since the remaining terms are linear or constant, t is in fact concave in The claim 
follows. □ 


Proof of Theorem 4.2. Clearly it suffices to show that each Ep^, [logdet J (T),) (0^)] is concave 
and for this it suffices to show that for fixed 9k, logdet J (Fk) (Ok) is concave. This is immediate, 
however, since the Jacobian is a linear function and logdet is a concave function. □ 

APPENDIX B: VARIATIONAL OBJECTIVE EUNCTIONS 

We derive the variational objectives and gradients for the models we analyze. Throughout, we 
make the convention that for A, B ^ 

((A, B)) = Tr{AB) 

denotes the trace inner product. 


B.l. Bayesian probit regression. In this section, we compute the variational objective for 
the Bayesian probit regression model. Eor convenience, we define 

Hk = Ep^ [/3fc] and Sk = Ep^ [PkPk] ■ 

In this notation, the variational objective takes the simple form 


1 


K 




k=l L 


{{Sk, W^Wk)) + 2Y,{{^^k^iJwl, IT,)) 
e^k 


N 


+ E 


Vn ■ E, [log 4'n] + (1 - Vn) ' Eg [log (1 - 4>n)] 


n=l _ 


— ^logdet (IT,) 


k=l 


where 4>„ = 4> (X), {{Wk, hxl))). 
This leads to the gradients 




SkW^ + ^ 

ij^k 
N 


+ ^Eg 




{Un ^n) ) ■ /3fc 




+ 


IT, 


-1 


K ’ 


where we have additionally defined (f)n = 4> {^k=i {{^k, f^kXn))^ and 

K 

P = Y,Wkf3k- 


k=l 
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B.2. Normal-inverse Wishart model. The variational objective for the normal-inverse 
Wishart model takes the form 


£ (W) = E, [Co (M/, Ai:,,)] +R[q], 


where 


K 


^oiW) = --Y, {{Rk {V-^ + X^X) Rl, WkDk)) 


k=l 


N ^ , nr N ^ 


+ y ^ {{Rk + X^^) , lTfc£>A:)) - y {{{Rk^k) {Rk^kf , ITfc-Dfc)) 

■ ' ^ k=l 

2 • log det (e Rl [WkDk] Rk] , 


+ 


k=l 

u+N-d-1 


\k=l 


and we have compressed our notation by setting fj, = J2k ^k^kk, x = -^ Yin Rk = R (Rk), and Dk = 
As before, we have 


1 ^ 

= y J]]logdet(lTfc), 


k=l 


where we have suppressed the constant depending on the since it does not vary with Wk- 
Recalling that Wk is diagonal, we can obtain the gradients by first computing 

Vry,£o (Vb) = Dk • diag [Rk + X^X) R^] 

N _ _ N 

+ ^ ■ Dk {Rk^k OX + Rkx o fi) - — ■ Dk (Rkfk) o (Rk^k) 


ly N — d — 1 
H-?;- Dk ■ diag 


K 


-1 


Rk{Y.^'^i^eDi]Ri] R 


£=1 


where we have used o to denote elementwise vector products. We then find 


VvKfe£ = Eg [Viyfc£o (W)] + 




-1 


K 


B.3. Mixture of Gaussians. Per the description of aggregation in Section 5, we dehne merged 
samples in the mixture of Gaussians model by the equations 


K 


0}=Fat {dl:K,l:L) = Y.Wki9ka,„ 


k=l 


where i = 1,... ,L denotes the cluster index and ak denotes the alignment mapping indices on the 
master core to indices on worker core k. Throughout this section, we treat the alignment variables 
as fixed. 

Using this notation, we define 


L n 


1 ^ 1 


2r2£^"^"2 2 ct2 

i=l £=1 i=l 


— X 


iM2 > 


D{Rk)- 
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where 


and 


_ 

~ : 
z^f'=l Ini' 


= exp ( \\9} -XnWl 


The variational objective then takes the form 

C (IT) = [Co (IT, ev.K,i-.L)] + H [g], 

with the usual equation 

K L 

log det {Wki) ■ 


k=l t=l 

Some calculation then shows that the gradients with respect to the various Wki are given by 

1 ^ 

VklCo (W", ei..K,l-.L) = ^ (1 - lil)m - XnWl ' {0} - Xnf 


n=l 


_ / 1 + I’O 




• Ok. 


nkt 


- Xl) , 


where 


^ f ^ X]n=l Inl 

Xi — \ ^ + 


-1 




N 

E 

n=l 


Jnl 

CJ^ 


• X„,. 


This covers the case of general PSD matrices Wki- When the matrices are restricted to be 
diagonal, we get the simplified gradient 




1 ^ 

{W, ei,K,V.L) = Inl) \\0i - XnWl ' ^ka^i ° {0} “ Xn) 


n=l 


1 

^ I /^n=\ 






Okau ° { 0*1 - xe), 


where o denotes elementwise multiplication of vectors. 

Since 

W~^ 

VkiC (W) = Ep,.^ [VkiC (W, 0i..k,1:l)] + 

this gives us all the information we need to implement an optimization procedure for the objective. 
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APPENDIX C: EXTENDED EMPIRICAL EVALUATION 



Number of cores Number of cores Number of cores 


Eig 5: Eive-dimensional probit regression (d = 5). Moment approximation error for the uniform and 
Gaussian averaging baselines and VCMC, relative to serial MCMC. We assessed three groups of 
functions: (left) first moments, with /(/3) = fdj for 1 < j < d; (center) pure second moments, with 
f{/3) = /3| for I < j < d; and (right) mixed second moments, with /(/3) = jdifdj for 1 < i < j < d. 
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Eig 6: High-dimensional probit regression {d = 300). Moment approximation error for the uniform 
and Gaussian averaging baselines and VCMC, relative to serial MCMC, for subposteriors (left) and 
partial posteriors (right). Here we show the pure second moments. 
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Fig 7: Five-dimensional normal-inverse Wishart model [d = 5). Moment approximation error for 
the uniform and Gaussian averaging baselines and VCMC, relative to serial MCMC. Letting pj 
denote the largest eigenvalue of A“^, we assessed three groups of functions: (left) first moments, 
with /(A) = Pj for 1 < j < d; (center) pure second moments, with /(A) = pj for 1 < j < d; and 
(right) mixed second moments, with /(A) = pipj for 1 < i < j < d. 
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